Improving Simultaneous Machine Translation with Monolingual Data
نویسندگان
چکیده
Simultaneous machine translation (SiMT) is usually done via sequence-level knowledge distillation (Seq-KD) from a full-sentence neural (NMT) model. However, there still significant performance gap between NMT and SiMT. In this work, we propose to leverage monolingual data improve SiMT, which trains SiMT student on the combination of bilingual external distilled by Seq-KD. Preliminary experiments En-Zh En-Ja news domain corpora demonstrate that can significantly quality (e.g., +3.15 BLEU En-Zh). Inspired behavior human simultaneous interpreters, novel sampling strategy for considering both chunk length monotonicity. Experimental results show our consistently outperforms random (and other conventional typical strategies) avoiding key problem -- hallucination, has better scalability. We achieve +0.72 improvements average against En-Ja. Data codes be found at https://github.com/hexuandeng/Mono4SiMT.
منابع مشابه
Improving Neural Machine Translation Models with Monolingual Data
Neural Machine Translation (NMT) has obtained state-of-the art performance for several language pairs, while only using parallel data for training. Monolingual data plays an important role in boosting fluency for phrase-based statistical machine translation, and we investigate the use of monolingual data for neural machine translation (NMT). In contrast to previous work, which integrates a sepa...
متن کاملImproving Statistical Machine Translation with Monolingual Collocation
This paper proposes to use monolingual collocations to improve Statistical Machine Translation (SMT). We make use of the collocation probabilities, which are estimated from monolingual corpora, in two aspects, namely improving word alignment for various kinds of SMT systems and improving phrase table for phrase-based SMT. The experimental results show that our method improves the performance of...
متن کاملMachine Translation - 09: Monolingual Data
4.2.1 Balancing the LM and TM In order for the decoder to flexibly balance the input from the LM and TM, we augment the decoder with a “controller” mechanism. The need to flexibly balance the signals arises depending on the work being translated. For instance, in the case of Zh-En, there are no Chinese words that correspond to articles in English, in which case the LM may be more informative. O...
متن کاملMonoTrans: Statistical Machine Translation from Monolingual Data
We present MonoTrans, a statistical machine translation system which only uses monolingual source language and target language data, without using any parallel corpora or language-specific rules. It translates each source word by the most similar target word, according to a combination of a string similarity measure and a word frequency similarity measure. It is designed for translation between...
متن کاملImproving Translation Model by Monolingual Data
We use target-side monolingual data to extend the vocabulary of the translation model in statistical machine translation. This method called “reverse self-training” improves the decoder’s ability to produce grammatically correct translations into languages with morphology richer than the source language esp. in small-data setting. We empirically evaluate the gains for several pairs of European ...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
ژورنال
عنوان ژورنال: Proceedings of the ... AAAI Conference on Artificial Intelligence
سال: 2023
ISSN: ['2159-5399', '2374-3468']
DOI: https://doi.org/10.1609/aaai.v37i11.26497